readability assessment
Hierarchical Ranking Neural Network for Long Document Readability Assessment
Zheng, Yurui, Chen, Yijun, Zhang, Shaohong
Readability assessment aims to evaluate the reading di ffi culty of a text. In recent years, while deep learning technol - ogy has been gradually applied to readability assessment, m ost approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This pap er proposes a bidirectional readability assessment mechan ism that captures contextual information to identify regions w ith rich semantic information in the text, thereby predicti ng the readability level of individual sentences. These sente nce-level labels are then used to assist in predicting the ov erall readability level of the document. Additionally, a pairwis e sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtrac tion. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive p erformance and outperforms other baseline models. Introduction Automatic Text Readability (ARA) research originated in th e early 20th century, aiming to evaluate text reading di ffi culty and assist educators in recommending appropriate rea ding materials for learners [ 1 ]. Readability assessment approaches are generally classified into three paradig ms: human evaluation, co-selection-based analysis, and content-based analysis. Human evaluation involves expert annotation or reader surveys; co-selection methods leverage user interaction data such as reading time or choices [ 2 ]; and content-based approaches infer readability using linguistic, syntactic, or semantic features extracted fro m the text itself. Early studies predominantly relied on experts' subjective evaluations and simple statistical feat ures, such as sentence length and word complexity.
Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics
Belem, Catarina G, Glenn, Parker, Samuel, Alfy, Kumar, Anoop, Liu, Daben
Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.
Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction
Elchafei, Passant, Osama, Mayar, Rageh, Mohamed, Abuelkheir, Mervat
We present a graph-based approach enriched with lexicons to predict document-level readability in Arabic, developed as part of the Constrained Track of the BAREC Shared Task 2025. Our system models each document as a sentence-level graph, where nodes represent sentences and lemmas, and edges capture linguistic relationships such as lexical co-occurrence and class membership. Sentence nodes are enriched with features from the SAMER lexicon as well as contextual embeddings from the Arabic transformer model. The graph neural network (GNN) and transformer sentence encoder are trained as two independent branches, and their predictions are combined via late fusion at inference. For document-level prediction, sentence-level outputs are aggregated using max pooling to reflect the most difficult sentence. Experimental results show that this hybrid method outperforms standalone GNN or transformer branches across multiple readability metrics. Overall, the findings highlight that fusion offers advantages at the document level, but the GNN-only approach remains stronger for precise prediction of sentence-level readability.
mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment
We present a simple, model-agnostic post-processing technique for fine-grained Arabic readability classification in the BAREC 2025 Shared Task (19 ordinal levels). Our method applies conformal prediction to generate prediction sets with coverage guarantees, then computes weighted averages using softmax-renormalized probabilities over the conformal sets. This uncertainty-aware decoding improves Quadratic Weighted Kappa (QWK) by reducing high-penalty misclassifications to nearer levels. Our approach shows consistent QWK improvements of 1-3 points across different base models. In the strict track, our submission achieves QWK scores of 84.9\%(test) and 85.7\% (blind test) for sentence level, and 73.3\% for document level. For Arabic educational assessment, this enables human reviewers to focus on a handful of plausible levels, combining statistical guarantees with practical usability.
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Basem, Mohamed, Younes, Mohamed, Ahmed, Seif, Moustafa, Abdelrahman
We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment, achieving first place in six of six tracks. Our approach is a confidence-weighted ensemble of four complementary transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with distinct loss functions to capture diverse readability signals. To tackle severe class imbalance and data scarcity, we applied weighted training, advanced preprocessing, SAMER corpus relabeling with our strongest model, and synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level samples. A targeted post-processing step corrected prediction distribution skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level, demonstrating the power of model and loss diversity, confidence-informed fusion, and intelligent augmentation for robust Arabic readability prediction.
A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Elmadani, Khalid N., Habash, Nizar, Taha-Thomure, Hanada
This paper introduces the Balanced Arabic Readability Evaluation Corpus BAREC, a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 68,182 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.3%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we will make BAREC openly available, along with detailed annotation guidelines and benchmark results.
A study of Vietnamese readability assessing through semantic and statistical features
Le, Hung Tuan, To, Long Truong, Nguyen, Manh Trong, Nguyen, Quyen, Do, Trong-Hop
Determining the difficulty of a text involves assessing various textual features that may impact the reader's text comprehension, yet current research in Vietnamese has only focused on statistical features. This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability. Our research utilized three distinct datasets: the Vietnamese Text Readability Dataset (ViRead), OneStopEnglish, and RACE, with the latter two translated into Vietnamese. Advanced semantic analysis methods were employed for the semantic aspect using state-of-the-art language models such as PhoBERT, ViDeBERTa, and ViBERT. In addition, statistical methods were incorporated to extract syntactic and lexical features of the text. We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees and evaluated their performance using accuracy and F1 score metrics. Our results indicate that a joint approach that combines semantic and statistical features significantly enhances the accuracy of readability classification compared to using each method in isolation. The current study emphasizes the importance of considering both statistical and semantic aspects for a more accurate assessment of text difficulty in Vietnamese. This contribution to the field provides insights into the adaptability of advanced language models in the context of Vietnamese text readability. It lays the groundwork for future research in this area.
Strategies for Arabic Readability Modeling
Liberato, Juan Piñeros, Alhafni, Bashar, Khalil, Muhamed Al, Habash, Nizar
Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic's morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.
An Open Multilingual System for Scoring Readability of Wikipedia
Trokhymovych, Mykola, Sen, Indira, Gerlach, Martin
With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and improving upon previous benchmarks. These results demonstrate the applicability of the model at scale for languages in which there is no ground-truth data available for model fine-tuning. Furthermore, we provide the first overview on the state of readability in Wikipedia beyond English.
InterEvo-TR: Interactive Evolutionary Test Generation With Readability Assessment
Delgado-Pérez, Pedro, Ramírez, Aurora, Valle-Gómez, Kevin J., Medina-Bulo, Inmaculada, Romero, José Raúl
Automated test case generation has proven to be useful to reduce the usually high expenses of software testing. However, several studies have also noted the skepticism of testers regarding the comprehension of generated test suites when compared to manually designed ones. This fact suggests that involving testers in the test generation process could be helpful to increase their acceptance of automatically-produced test suites. In this paper, we propose incorporating interactive readability assessments made by a tester into EvoSuite, a widely-known evolutionary test generation tool. Our approach, InterEvo-TR, interacts with the tester at different moments during the search and shows different test cases covering the same coverage target for their subjective evaluation. The design of such an interactive approach involves a schedule of interaction, a method to diversify the selected targets, a plan to save and handle the readability values, and some mechanisms to customize the level of engagement in the revision, among other aspects. To analyze the potential and practicability of our proposal, we conduct a controlled experiment in which 39 participants, including academics, professional developers, and student collaborators, interact with InterEvo-TR. Our results show that the strategy to select and present intermediate results is effective for the purpose of readability assessment. Furthermore, the participants' actions and responses to a questionnaire allowed us to analyze the aspects influencing test code readability and the benefits and limitations of an interactive approach in the context of test case generation, paving the way for future developments based on interactivity.